I'm going to do a very simple analysis of Baltimore crime to show off R. We'll use data downloaded from Baltimore City's awesome open data site.
Let's load the data:
arrest_tab = read.csv("BPD_Arrests.csv", stringsAsFactors = FALSE)
cctv_tab = read.csv("CCTV_Locations.csv", stringsAsFactors = FALSE)
# these columns are mislabeled, so fix them
tmp = arrest_tab$sex
arrest_tab$sex = arrest_tab$race
arrest_tab$race = tmp
# dimension of table (data.frame)
dim(arrest_tab)
## [1] 104528 15
# what are the columns
names(arrest_tab)
## [1] "arrest" "age" "sex"
## [4] "race" "arrestDate" "arrestTime"
## [7] "arrestLocation" "incidentOffense" "incidentLocation"
## [10] "charge" "chargeDescription" "district"
## [13] "post" "neighborhood" "Location.1"
# what is the average arrest age?
mean(arrest_tab$age)
## [1] 33.2
# the range of arrest ages
range(arrest_tab$age)
## [1] 0 87
# how many arrests per sex
table(arrest_tab$sex)
##
## F M
## 2 19431 85095
# what are the most common offenses
head(sort(table(arrest_tab$incidentOffense), decreasing = TRUE))
##
## Unknown Offense 87-Narcotics 4E-Common Assault
## 38649 24744 6739
## 87O-Narcotics (Outside) 97-Search & Seizure 79-Other
## 6515 3670 3461
# range of arrests after removing those w/ age==0
range(arrest_tab$age[arrest_tab$age > 0])
## [1] 8 87
Let's see a table of arrests by sex and race
table(sex = arrest_tab$sex, race = arrest_tab$race)
## race
## sex A B H I U W
## 2 0 0 0 0 0 0
## F 0 37 14663 0 34 183 4514
## M 0 205 72605 1 184 1566 10534
A histogram of age
hist(arrest_tab$age, nc = 100)
<<<<<<< HEAD
with(arrest_tab, hist(age[sex == "M"], nc = 100)) # males only
<<<<<<< HEAD
with(arrest_tab, hist(age[sex == "F"], nc = 100)) # females only
<<<<<<< HEAD
Let's take a look at how age depends on sex. Let's plot age as a function of sex first (notice how we indicate that sex is a factor).
plot(arrest_tab$age ~ factor(arrest_tab$sex))
<<<<<<< HEAD
One of the neat things about R is that statistical model building and testing is built-in. The model we use is \( y_i=\beta_0+\beta_1 x_i \) where \( y_i \) is age of sample (example) \( i \) and \( x_i \) is an indicator variable \( x_i \in \{0,1\} \) with \( x_i=1 \) if the \( i \)-th record (example) is male. You can check that \( \beta_1 \) is the difference in mean age between females and males. We use the formula syntax to build a linear regression model.
# let's ignore those records with missing sex
fit = lm(age ~ factor(sex), data = arrest_tab, subset = arrest_tab$sex %in%
c("M", "F"))
summary(fit)
##
## Call:
## lm(formula = age ~ factor(sex), data = arrest_tab, subset = arrest_tab$sex %in%
## c("M", "F"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.39 -10.15 -3.15 9.61 53.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.3878 0.0847 394.1 <2e-16 ***
## factor(sex)M -0.2343 0.0939 -2.5 0.013 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.8 on 104524 degrees of freedom
## Multiple R-squared: 5.96e-05, Adjusted R-squared: 5e-05
## F-statistic: 6.23 on 1 and 104524 DF, p-value: 0.0126
We see that \( \beta_1 \approx -0.2 \) meaning that the arrest age for males is about 2.5 months younger. So there is very little difference in the average age (which is what the linear model is testing) but we see that the probability of observing this difference from a sample of this size when there is no difference in average age is small \( p \approx 0.01 \). Since we have a very large number of examples, or records, this testing framework will declare very small differences as statistically significant. We'll return to this theme later in class.
First we need to extract latitude and longitude from location, we'll use some string functions to do this
tmp = gsub("\\)", "", gsub("\\(", "", arrest_tab$Location))
## Warning: Name partially matched in data frame
tmp = strsplit(tmp, split = ",")
arrest_tab$lon = as.numeric(sapply(tmp, function(x) x[2]))
arrest_tab$lat = as.numeric(sapply(tmp, function(x) x[1]))
Now let's plot
plot(arrest_tab$lon, arrest_tab$lat, xlab = "Longitude", ylab = "Latitude",
main = "Arrests in Baltimore")
<<<<<<< HEAD
We can also use density estimates to make this nicer:
smoothScatter(arrest_tab$lat, arrest_tab$lon, xlab = "Latitude", ylab = "Longitude",
main = "Arrests in Baltimore")
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
<<<<<<< HEAD
Let's make this fancier using the ggplot2 graphics systems and the maps package containing map data.
library(maps)
library(ggplot2)
balto_map = subset(map_data("county", region = "maryland"), subregion == "baltimore city")
plt = ggplot()
plt = plt + geom_polygon(data = balto_map, aes(x = long, y = lat), color = "white",
fill = "gray40")
plt = plt + geom_point(data = arrest_tab, aes(x = lon, y = lat), color = "blue",
alpha = 0.1)
print(plt)
## Warning: Removed 40636 rows containing missing values (geom_point).
<<<<<<< HEAD
Now let's add CCTV cameras.
tmp = gsub("\\)", "", gsub("\\(", "", cctv_tab$Location))
## Warning: Name partially matched in data frame
tmp = strsplit(tmp, split = ",")
cctv_tab$lon = as.numeric(sapply(tmp, function(x) x[2]))
cctv_tab$lat = as.numeric(sapply(tmp, function(x) x[1]))
plt = ggplot()
plt = plt + geom_polygon(data = balto_map, aes(x = long, y = lat), color = "white",
fill = "gray40")
plt = plt + geom_point(data = arrest_tab, aes(x = lon, y = lat), color = "blue",
alpha = 0.1)
plt = plt + geom_point(data = cctv_tab, aes(x = lon, y = lat), color = "red")
print(plt)
## Warning: Removed 40636 rows containing missing values (geom_point).
<<<<<<< HEAD
Is there any relationship between the number of CCTV cameras and the number of arrests? Divide the city into a grid and plot the number of CCTV cameras vs. the number of arrests.
latRange = range(arrest_tab$lat, na.rm = TRUE)
lonRange = range(arrest_tab$lon, na.rm = TRUE)
latGrid = seq(min(latRange), max(latRange), len = 50)
lonGrid = seq(min(lonRange), max(lonRange), len = 50)
latFac = as.numeric(cut(arrest_tab$lat, breaks = latGrid))
lonFac = as.numeric(cut(arrest_tab$lon, breaks = lonGrid))
gridFac = (latFac - 1) * length(latGrid) + (lonFac - 1)
latFac = as.numeric(cut(cctv_tab$lat, breaks = latGrid))
lonFac = as.numeric(cut(cctv_tab$lon, breaks = lonGrid))
cctvGridFac = (latFac - 1) * length(latGrid) + (lonFac - 1)
arrestTab = table(gridFac)
cctvTab = table(cctvGridFac)
m = match(names(cctvTab), names(arrestTab))
plot(arrestTab[m] ~ factor(cctvTab))
<<<<<<< HEAD
As part of HW1 you will add to this analysis. Please use the following template:
Eric Krokos
What question are you asking?: What does the distribution of arrests look like for Asians(Blue), Blacks(Red), Whites(Green), and Unknown (Orange)?
What is the code you use to answer it?:
asian = arrest_tab[arrest_tab$race == "A", ]
black = arrest_tab[arrest_tab$race == "B", ]
white = arrest_tab[arrest_tab$race == "W", ]
unknown = arrest_tab[arrest_tab$race == "U", ]
library(maps)
library(ggplot2)
balto_map = subset(map_data("county", region = "maryland"), subregion == "baltimore city")
plt = ggplot()
plt = plt + geom_polygon(data = balto_map, aes(x = long, y = lat), color = "white",
fill = "gray40")
plt = plt + geom_point(data = asian, aes(x = lon, y = lat), color = "blue",
alpha = 0.1)
plt = plt + geom_point(data = black, aes(x = lon, y = lat), color = "red", alpha = 0.1)
plt = plt + geom_point(data = white, aes(x = lon, y = lat), color = "green",
alpha = 0.1)
plt = plt + geom_point(data = unknown, aes(x = lon, y = lat), color = "orange",
alpha = 0.1)
print(plt)
## Warning: Removed 93 rows containing missing values (geom_point).
## Warning: Removed 33454 rows containing missing values (geom_point).
## Warning: Removed 6251 rows containing missing values (geom_point).
## Warning: Removed 749 rows containing missing values (geom_point).
<<<<<<< HEAD
The red (Black) dominate the map, but it's interesting that most crimes commited by Whites are done on the southern most part of the city, with Black crime occuring primarily in the upper part. It's difficult to see Asian commited crime. Given that most of the orange dots are in the southern part of the city, I would guess that most of the unknown races are from White people.
What question are you asking?: In which months do arrests occur the most? On what day of the week do arrests occur the most?
What is the code you use to answer it?:
arrestDates = as.Date(arrest_tab$arrestDate, "%m/%d/%Y")
dwka <- format(arrestDates, "%a")
# dwka
dwkn <- as.numeric(format(arrestDates, "%w")) # numeric version
hist(dwkn, main = "Arrests by Day of Week", xlab = "Day of Week", ylab = "Arrest Frequency",
breaks = -0.5 + 0:7, labels = unique(dwka[order(dwkn)]))
<<<<<<< HEAD
dwkm <- format(arrestDates, "%b") # get month labels
dwkmn <- as.numeric(format(arrestDates, "%m")) # get numeric data for histogram
hist(dwkmn, main = "Arrests by Month", xlab = "Month of Year", ylab = "Arrest Frequency",
breaks = 0:12, labels = unique(dwkm[order(dwkmn)]))
<<<<<<< HEAD
At the month level, we observe that there is a decline in arrest rates during the winter months and that arrests peak in March and August.
At the day level, we observe that most arrests occur in the middle of the week and that significantly fewer arrests occur on the weekends.
Is there a relationship between the age of the arrested individual and the time they are arrested?
# change the time format from hh:mm to hh.mm so it can be converted to a
# numeric value
tmp = gsub(":", ".", arrest_tab$arrestTime)
# convert the arrestTime to numeric
arrestTime = as.numeric(tmp)
# group ages into 5 buckets
ages = arrest_tab$age
ages = cut(ages, 5)
# plot the age buckets as a factor of the numeric arrest times
plot(arrestTime ~ factor(ages))
<<<<<<< HEAD
<<<<<<< HEAD
<<<<<<< HEAD
From these plots we can see slight differences in the average times of day members of these age groups are arrested. We can observe that as age increases, generally the time of arrest is earlier in the day though this is only a very slight difference. For all groups the average time of arrest is very close to 15:00.
What question are you asking?: Is there a relationship between the time a person is arrested and their sex?
What is the code you use to answer it?:
# Make a copy of arrest_tab since we need to modify
arrest_tab_skb <- arrest_tab
# Filter out arrests with unknown sex
arrest_tab_skb <- arrest_tab_skb[arrest_tab_skb$sex == "M" | arrest_tab_skb$sex ==
"F", ]
# Convert time to just hours as numeric
arrest_tab_skb["arrestTime"] <- lapply(arrest_tab_skb["arrestTime"], function(x) as.numeric(strftime(strptime(x,
format = "%H:%M"), "%H")))
# Plot male arrests by hour
hist(arrest_tab_skb[arrest_tab_skb$sex == "M", ]$arrestTime, breaks = 24, main = "Hour of Arrest Frequency for Males",
xlab = "Hour of the Day", col = "lightblue", xlim = range(0:23), xaxt = "n")
axis(side = 1, at = seq(0, 23), labels = seq(0, 23))
<<<<<<< HEAD
# Plot female arrests by hour
hist(arrest_tab_skb[arrest_tab_skb$sex == "F", ]$arrestTime, breaks = 24, main = "Hour of Arrest Frequency for Females",
xlab = "Hour of the Day", col = "pink", xlim = range(0:23), xaxt = "n")
axis(side = 1, at = seq(0, 23), labels = seq(0, 23))
<<<<<<< HEAD
# Plot relationship between hour of arrest and sex
plot(arrest_tab_skb$arrestTime ~ factor(arrest_tab_skb$sex), main = "Relationship Between Hour of Arrest and Sex",
xlab = "Sex", ylab = "Hour of Arrest")
<<<<<<< HEAD
What did you observe?: From the two histograms it isn't readily apparent that there is any relationship between the time a person is arrested and their sex. However, from the third plot we can see that males tend to be arrested slightly later in the day when compared to females.
What question are you asking?:
Do violent crimes occur more in certain areas?
What is the code you use to answer it?:
# The FBI classifies the following as violent crime: - murder and
# nonnegligent manslaughter - forcible rape - robbery - aggravated assault
# as per
# http://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2011/crime-in-the-u.s.-2011/violent-crime/violent-crime
violent = c("1A-Murder", "2A-Rape (Force)", "2B-Rape (Attempt)", "3AF-Robb Hwy-Firearm",
"3AJF-Robb Carjack-Firearm", "3AJK-Robb Carjack-Knife", "3AK-Robb Hwy-Knife",
"3AO-Robb Hwy-Other Wpn", "3CF-Robb Comm-Firearm", "3CK-Robb Comm-Knife",
"3CO-Robb Comm-Other Wpn", "3EF-Robb Gas Station-Firearm", "3EK-Robb Gas Station-Knife",
"3GF-Robb Conv Store-Firearm", "3GK-Robb Conv Store-Knife", "3GO-Robb Conv Store-Other Wpn",
"3JF-Robb Residence-Firearm", "3JK-Robb Residence-Knife", "3JO-Robb Residence-Other Wpn",
"3LF-Robb Bank-Firearm", "3FO-RObb Bank-Other Wpn", "3NF-Robb Misc-Firearm",
"3NK-Robb Misc-Knife", "3NO-Robb Misc-Other Wpn", "4A-Agg. Asslt.- Gun",
"4B-Agg. Asslt.- Cut", "4C-Agg. Asslt.- Oth.", "4D-Agg. Asslt.- Hand")
plt = ggplot()
plt = plt + geom_polygon(data = balto_map, aes(x = long, y = lat), color = "white",
fill = "gray40")
plt = plt + geom_point(data = arrest_tab, aes(x = lon, y = lat), color = "blue",
alpha = 0.1)
plt = plt + geom_point(data = arrest_tab[arrest_tab$incidentOffense %in% violent,
], aes(x = lon, y = lat), color = "red", alpha = 0.2)
print(plt)
## Warning: Removed 40636 rows containing missing values (geom_point).
## Warning: Removed 997 rows containing missing values (geom_point).
<<<<<<< HEAD
What did you observe?
It appears that there is a heavier concentration of violent crimes in the inner city, particularly in a few hotspots (of which I am unaware the significance of). There also appear to be a couple of specific outlying locations where an unusually large number of violent crimes occur.
What question are you asking?
Is there any difference between minors and adults in terms of time and location? (we suppose people under 21 are minors)
What is the code you use to answer it?
We split data into two parts (i.e. minors (age < 21) and adults (age >= 21)), plot the distribution of arrest time and draw scatters according to the location for both minors and adults.
# copy data frame, filter out invalid data (age = 0)
arrestTmp <- subset(arrest_tab, arrest_tab$age != 0)
# extract geo info
tmp <- gsub("\\).*", "", gsub(".*\\(", "", arrestTmp$Location))
## Warning: Name partially matched in data frame
tmp = strsplit(tmp, split = ",")
arrestTmp$lat <- as.numeric(sapply(tmp, function(x) x[2]))
arrestTmp$lon <- as.numeric(sapply(tmp, function(x) x[1]))
# suppress warnings
arrestTmp <- subset(arrestTmp, !is.na(lon) & !is.na(lat))
# tag data with 'minors' and 'adults' according to their ages
arrestTmp$biAge <- sapply(arrestTmp$age, function(x) {
if (x < 21) {
return("minors")
} else {
return("adults")
}
})
# new col: time -> time in Hour
arrestTmp$arrestTimeH <- as.integer(gsub(":[0-9][0-9]", "", arrestTmp$arrestTime))
# plot minors' distribution
minors_tab <- subset(arrestTmp, arrestTmp$biAge == "minors")
minorsArray <- minors_tab$arrestTimeH
minorsTable <- as.integer(table(factor(minorsArray)))
barplot(minorsTable, space = 0, width = 1, xlab = "Time", ylab = "Number of arrset",
main = "Number of arrest in each hour (minors)", axes = T)
axis(side = 1, at = seq(0, 23), labels = seq(0, 23))
<<<<<<< HEAD
# plot adults' distribution
adults_tab <- subset(arrestTmp, arrestTmp$biAge == "adults")
adultsArray <- adults_tab$arrestTimeH
adultsTable <- as.integer(table(factor((adultsArray))))
barplot(adultsTable, space = 0, width = 1, xlab = "Time", ylab = "Number of arrset",
main = "Number of arrest in each hour (adults)", axes = T)
axis(side = 1, at = seq(0, 23), labels = seq(0, 23))
<<<<<<< HEAD
plot(arrestTmp$arrestTimeH ~ factor(arrestTmp$biAge), main = "Relationship Between time of Arrest and minor or not",
xlab = "age", ylab = "Hour of Arrest")
<<<<<<< HEAD
# visualize arrest location for minors & adults
smoothScatter(minors_tab$lat, minors_tab$lon, xlab = "Latitude", ylab = "Longitude",
main = "Arrests in Baltimore (minors)")
<<<<<<< HEAD
smoothScatter(adults_tab$lat, adults_tab$lon, xlab = "Latitude", ylab = "Longitude",
main = "Arrests in Baltimore (adults)")
<<<<<<< HEAD
# cleanup
rm(arrestTmp, tmp)
rm(adults_tab, adultsTable, adultsArray)
rm(minors_tab, minorsTable, minorsArray)
What did you observe?
What question are you asking?: What is the most common incident offense, and how does it related to sex, race and district?
What is the code you use to answer it?:
old_par <- par(no.readonly = TRUE)
# first let's look at the bar plot of the distribution of incident offense
# types as there are many # of inccident offsense types, I use a filter to
# get rid of 'unknown offense' that we don't care about, and the ones with
# less than 200 cases in order to make the plot clear to understand
library(sqldf)
<<<<<<< HEAD
<<<<<<< HEAD
## Loading required package: RSQLite
=======
=======
>>>>>>> 0e04f064bb355d87894fd2913c6d15a069b0daf6
## Loading required package: gsubfn
## Loading required package: proto
## Loading required namespace: tcltk
## Loading required package: RSQLite
<<<<<<< HEAD
>>>>>>> f520a6901b5f9c89f7fa77a941026b208c6bfe34
=======
>>>>>>> 0e04f064bb355d87894fd2913c6d15a069b0daf6
## Loading required package: DBI
## Loading required package: RSQLite.extfuns
a = sqldf("select incidentOffense, count(*) as cnt from arrest_tab where incidentOffense != 'Unknown Offense' group by incidentOffense having cnt >= 200 order by cnt")
<<<<<<< HEAD
par(las = 2, mar = c(5, 7, 4, 2))
barplot(a$cnt, horiz = TRUE, cex.names = 0.7, names.arg = a$incidentOffense)
=======
## Loading required package: tcltk
par(las = 2, mar = c(5, 7, 4, 2))
barplot(a$cnt, horiz = TRUE, cex.names = 0.7, names.arg = a$incidentOffense)
<<<<<<< HEAD
# as you can see, the most common incident offense is 'Narcotics' (the 1st
# and 3rd row) next, let's analyze the correlations between 'Narcotics'
# inccident type and sex, race and district this is a correlation analysis
# invovling in multiple categorical variables I use a mosaic plot instead of
# the correlogram plot to show their correlations in one plot
library(vcd)
## Loading required package: grid
filtered <- subset(arrest_tab, (race == "B" | race == "W") & sex != "" & district !=
"" & (incidentOffense == "87-Narcotics" | incidentOffense == "87O-Narcotics (Outside)"))
filtered$district <- gsub("NORTHERN", "N", gsub("NORTHEASTERN", "NE", gsub("NORTHWESTERN",
"NW", filtered$district)))
filtered$district <- gsub("SOUTHERN", "S", gsub("SOUTHEASTERN", "SE", gsub("SOUTHWESTERN",
"SW", filtered$district)))
filtered$district <- gsub("WESTERN", "W", gsub("EASTERN", "E", gsub("CENTRAL",
"C", filtered$district)))
mosaic(~district + sex + race + incidentOffense, data = filtered, shade = TRUE,
legend = TRUE)
<<<<<<< HEAD
par(old_par)
What did you observe?: The first bar plot shows that the 'Narcotics' is the most common inccident offense in this city. Interestingly 'Narcotics (Outside)' is another top3 inccident offense type.
In the second plot, I use mosaic plot to visualize the correlations among multiple categorical variables (race (b, w), sex (m, f), district (w, sw, se, s, nw, ne, n, e, c)). Rectangle sizes reflect the frequentices of a group, the colors shows the pearson residuals (blue means above expected value, while red means it's below expected value of fitted model).
Here're some interesting findings:
From previous map plots with race info, we know that in the southern regions, white cases and black cases are similar. It is the same case for Narcotics. The south regions (SE, S), the numbers are similar (we can see from the heighs of the rectangle). In other regions, black dominants white.
Where you can find narcotics cases? And where does outside Narcotics case happen more often than indoor? North (N), Southerne East (SE) and South (S) regions have the fewest instances of Narcotics cases. In the north eastern (NE) and south western (SW) regions, the Narcotics outside cases are above expected value, while Eastern (E), Northwestern (NW) and Western (W) regions outside cases are below expected value.
Female's narcotics active region is quite different from male. In the central district ©, both female black and white are above the expected values for both inside and outside narcotics cases, while male cases regardless races are below expected value.
What question are you asking?:
What neighborhoods have the highest number of arrests? Of narcotics arrests? What is the nature of the crime in the worst neighborhoods?
What is the code you use to answer it?:
# We begin by counting the total number of arrest records per neighborhood,
# throwing out those where the neighborhood is empty.
total <- table(arrest_tab$neighborhood[arrest_tab$neighborhood != ""])
# Now we do it again, using only narcotics arrests.
narc.arrests <- arrest_tab$neighborhood[grep("narcotics", arrest_tab$incidentOffense,
ignore.case = TRUE)]
narc <- table(narc.arrests[narc.arrests != ""])
# We want to put these vectors into the same data frame, so let's throw away
# neighborhoods with no narcotics arrests.
total <- total[names(total) %in% names(narc)]
df <- data.frame(total)
df$narc <- narc
# Now we can sort the frame by total number of arrests.
df <- df[order(df$total, df$narc), ]
df$nonnarc <- df$total - df$narc
# We'll get rid of extra variables; everything we need is in the frame.
rm(total, narc, narc.arrests)
# And generate a plot of neighborhoods with the most arrests, along with how
# many of them are for narcotics. We looked at Hui Miao's code to find out
# how to save and restore graph parameters.
old_par <- par(no.readonly = TRUE)
par(las = 2, mar = c(4, 9, 3, 2) + 0.1)
barplot(t(as.matrix(df[df$total > 600, 2:3])), horiz = TRUE, col = c("orange",
"blue"), cex.names = 0.7, main = "Number of Arrests per Neighborhood--")
legend(2000, 12, legend = c("narcotics", "non-narcotics"), fill = c("orange",
"blue"), cex = 0.7)
<<<<<<< HEAD
par(old_par)
# Using the code above it, we will now create graphics for the 5
# neighborhoods with the most crime: Downtown, Sandtown-Winchester, Central
# Park Heights, Broadway East, Belair-Edison
library(plotrix)
bmoreHoodAnalyze <- function(arg1) {
# This function cleans the Baltimore dataset by combining similar crimes and
# and graphically depicts the most frequently crimes in an area
a <- table(arrest_tab$incidentOffense[arrest_tab$neighborhood == arg1])
df <- data.frame(a)
# Combine sums Narcotics
narcoSum <- sum(df$Freq[df$Var1 == "87O-Narcotics (Outside)" | df$Var1 ==
"87-Narcotics"])
df$Freq[df$Var1 == "87-Narcotics"] <- narcoSum
df <- df[!df$Var1 == "87O-Narcotics (Outside)", ]
## Unlabeled crimes
unkSum <- sum(df$Freq[df$Var1 == "Unknown Offense" | df$Var1 == "79-Other"])
df$Freq[df$Var1 == "Unknown Offense"] <- unkSum
df <- df[!df$Var1 == "79-Other", ]
df <- df[!df$Var1 == "Unknown Offense", ]
# Concatenate the remainder of the crimes
df <- df[order(-df$Freq), ]
l <- as.vector(df[, 1])[0:5]
freq <- as.vector(df[, 2])[0:5]
freq <- c(freq, sum(as.vector(df[, 2])[6:dim(df)[1]]) + unkSum)
l <- c(l, "Other")
title <- paste0("Distribution of Type of Crime in ", arg1)
pie3D(freq, labels = l, explode = 0.2, theta = pi/2.5, labelcex = 0.7, main = title)
}
bmoreHoodAnalyze("Downtown")
<<<<<<< HEAD
<<<<<<< HEAD
## Error: could not find function "pie3D"
=======
bmoreHoodAnalyze("Sandtown-Winchester")
<<<<<<< HEAD
<<<<<<< HEAD
## Error: could not find function "pie3D"
=======
bmoreHoodAnalyze("Central Park Heights")
<<<<<<< HEAD
<<<<<<< HEAD
## Error: could not find function "pie3D"
=======
bmoreHoodAnalyze("Broadway East")
<<<<<<< HEAD
<<<<<<< HEAD
## Error: could not find function "pie3D"
=======
bmoreHoodAnalyze("Belair-Edison")
<<<<<<< HEAD
<<<<<<< HEAD
## Error: could not find function "pie3D"
=======
What did you observe?:
It is clear from this analysis that narcotics related crimes are the most frequent ones across many neighborhoods in Baltimore.
Unfortunately, the Baltimore database left many entries as “79-Other” or unknown, so specific crimes are hard to correlate. However, with some cleaning of the data-set, a clear pattern emerged in these problem spots. After narcotics, the most frequent crime to be encountered would be assault or search/seizure related, followed by towed vehicles.
What question are you asking?:
1. As presented earlier by another team, arrests rate on the weekend is not as high as in the middle of the week, although it was expected for crimes to be a bit more in the weekend. So Why is that?
2. How is the number of vacant houses correlated with the crime rate in a neighborhood? We are concerned with two types of crimes: 1) Violence-related. 2) Narcotic-related. We obtained a dataset for vacant buildings in Baltimore and used it to plot their locations vs. the geographical distribution of each crime category.
What is the code you use to answer it?:
Test1: Crimes on the weekend.
# Define two types of crimes: Violence-Related, Narcotic-Related
narcotics <- c("87-Narcotics", "87O-Narcotics (Outside)", "97-Search & Seizure")
violentCrimes <- c("102-Questional Death", "104-Malicious Burning", "105-Suspicious Burning",
"115-Trespassing", "1A-Murder", "2A-Rape (Force)", "2B-Rape (Attempt)",
"2D-Statutory Rape", "2G-Sodomy/Perverson", "2J-Other Sex Offn.", "39-Fire",
"3AF-Robb Hwy-Firearm", "3AJF-Robb Carjack-Firearm", "3AJK-Robb Carjack-Knife",
"3AK-Robb Hwy-Knife", "3AO-Robb Hwy-Other Wpn", "3B-Robb Highway (Ua)",
"3BJ-Robb Carjack(Ua)", "3CF-Robb Comm-Firearm", "3CK-Robb Comm-Knife",
"3CO-Robb Comm-Other Wpn", "3D-Robb Comm. (Ua)", "3EF-Robb Gas Station-Firearm",
"3EK-Robb Gas Station-Knife", "3F-Robb Gas Sta. (Ua)", "3GF-Robb Conv Store-Firearm",
"3GK-Robb Conv Store-Knife", "3GO-Robb Conv Store-Other Wpn", "3H-Robb Conv. Stor.(Ua)",
"3JF-Robb Residence-Firearm", "3JK-Robb Residence-Knife", "3JO-Robb Residence-Other Wpn",
"3K-Robb Res. (Ua)", "3LF-Robb Bank-Firearm", "3FO-RObb Bank-Other Wpn",
"3M-Robb Bank (Ua)", "3NF-Robb Misc-Firearm", "3NK-Robb Misc-Knife", "3NO-Robb Misc-Other Wpn",
"3P-Robb Misc. (Ua)", "49-Family Disturbance", "4A-Agg. Asslt.- Gun", "4B-Agg. Asslt.- Cut",
"4C-Agg. Asslt.- Oth.", "4D-Agg. Asslt.- Hand", "4E-Common Assault", "4F-Assault By Threat",
"52A-Animal Cruelty", "56-Missing Person", "75-Destruct. Of Property", "76-Child Abuse-Sexual",
"7A-Stolen Auto")
# Arrest data for the two types of crimes, and crimes with unknown offense
violent_arrests <- arrest_tab[arrest_tab$incidentOffense %in% violentCrimes,
]
narcotic_arrests <- arrest_tab[arrest_tab$incidentOffense %in% narcotics, ]
unknown_arrests <- arrest_tab[arrest_tab$incidentOffense == "Unknown Offense",
]
# Histogram function for crimes over the days of week
hist_dayofweek <- function(dataDates, dataName) {
dates = as.Date(dataDates, "%m/%d/%Y")
dayofweek <- format(dates, "%a")
dayofweek_nums <- as.numeric(format(dates, "%w")) # numeric version
hist(dayofweek_nums, main = paste(dataName, " by Day of Week"), xlab = "Day of Week",
ylab = "Frequency", breaks = -0.5 + 0:7, labels = unique(dayofweek[order(dayofweek_nums)]))
}
# Plot histogram of violence-related arrests over the week
hist_dayofweek(violent_arrests$arrestDate, "Violence-Related Arrests")
<<<<<<< HEAD
# Plot histogram of narcotic-related arrests over the week
hist_dayofweek(narcotic_arrests$arrestDate, "Narcotic-Related Arrests")
<<<<<<< HEAD
# Plot histogram of unknown-offense arrests over the week
hist_dayofweek(unknown_arrests$arrestDate, "Unknown Offense Arrests")
<<<<<<< HEAD
Test2: Vacant buildings vs. Crime areas.
# Prepare the dataset of vacant buildings
vacant_tab <- read.csv("Vacant_Buildings.csv", stringsAsFactors = FALSE)
# prepare long/lat comlumns
tmp = gsub("\\)", "", gsub("\\(", "", vacant_tab$Location))
## Warning: Name partially matched in data frame
tmp = strsplit(tmp, split = ",")
vacant_tab$lon = as.numeric(sapply(tmp, function(x) x[2]))
vacant_tab$lat = as.numeric(sapply(tmp, function(x) x[1]))
# Plot the geographical distribution of vacant buildings vs. arrests
library(ggplot2)
library(ggmap)
# Function to plot datapoints using GoogleMaps API
plot_map <- function(map, dataPoints, dataPoints2) {
googleMap = ggmap(map)
googleMap = googleMap + geom_point(data = dataPoints, aes(x = lon, y = lat),
color = "black", alpha = 0.1, size = 3)
if (!missing(dataPoints2)) {
googleMap = googleMap + geom_point(data = dataPoints2, aes(x = lon,
y = lat), color = "yellow", alpha = 0.1, size = 1.5)
}
googleMap
}
# Get Baltimore city map from GoogleMaps
map = get_map(location = c(lon = -76.62, lat = 39.3), zoom = 12, maptype = "terrain")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=39.3,-76.62&zoom=12&size=%20640x640&scale=%202&maptype=terrain&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
# Plot violence-related arrests vs. vacant buildings locations
plot_map(map, violent_arrests, vacant_tab)
## Warning: Removed 2588 rows containing missing values (geom_point).
# Plot narcotic-related arrests vs. vacant buildings locations
plot_map(map, narcotic_arrests, vacant_tab)
## Warning: Removed 2744 rows containing missing values (geom_point).
What did you observe?:
1. For Q1:
- As observed from the figures, violence-related arrests are high in weekends (with a peak on Saturday) as anticipated.
- Both narcotic-related arrests and arrests with unknown-offense have their peak in the middle of the week, with a lower rate on weekends.
- As the number of records for narcotic-related and unknown offense arrests are much higher than violence-related arrests, the overall total crime rate has its peak on Wednesday, not in weekends.
2. For Q2:
As predicted, the figures show that the vacant buildings in a neighborhood are correlated with its crime rate, with a bit more correlation with neighborhoods where arrests due to violence occurred than the narcotic-related arrests.
What question are you asking?:
Q1. Is there a relationship between arrest time and crime type?
Q2. Is there a relationship between arrest time and arrest location (longitude, latitude)?
What is the code you use to answer it?:
library(ggmap)
library(ggplot2)
# Define function that converts time to numerical value for calculation
time2num = function(x) {
tmp = as.numeric(gsub("\\:", "", x))
x = tmp/100 + tmp%%100/60
}
# Get subtable from the original table
arrest_tab_tmp1 = subset(arrest_tab, incidentOffense != "", select = c(arrestTime,
incidentOffense))
arrest_tab_tmp1 = arrest_tab_tmp1[complete.cases(arrest_tab_tmp1), ]
arrest_tab_tmp1$arrestTime = sapply(arrest_tab_tmp1$arrestTime, time2num)
# Plot the the relationship between arrest time and crime type, reorder them
# based on the median of arrest time value
arrest_tab_tmp1$incidentOffense = with(arrest_tab_tmp1, reorder(incidentOffense,
arrestTime, median))
qplot(factor(arrest_tab_tmp1$incidentOffense), arrest_tab_tmp1$arrestTime, main = "Relationship Between Arrest Time and Crime Type",
xlab = "Crime Type", ylab = "Arrest Time", geom = "boxplot", asp = 2) +
coord_flip()
<<<<<<< HEAD
# Get subtable from the original table
arrest_tab_tmp2 = subset(arrest_tab, select = c(arrestTime, lon, lat))
arrest_tab_tmp2 = arrest_tab_tmp2[complete.cases(arrest_tab_tmp2), ]
arrest_tab_tmp2$arrestTime = sapply(arrest_tab_tmp2$arrestTime, time2num)
# Shift time axis to start at 6am for better representation
shift_axis_shift = function(x) {
if (x < 6) {
x = x + 24
} else {
x = x
}
}
arrest_tab_tmp2$arrestTime = sapply(arrest_tab_tmp2$arrestTime, shift_axis_shift)
# Cut number into bins for color visualization
rg_pal = colorRampPalette(c(c("light green", "yellow", "red")), bias = 1)
a_color = rg_pal(12)[as.numeric(cut(arrest_tab_tmp2$arrestTime, breaks = 12))]
# Get Baltemore map from Google Map
map = get_map(location = c(lon = -76.62, lat = 39.3), zoom = 12, maptype = "roadmap")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=39.3,-76.62&zoom=12&size=%20640x640&scale=%202&maptype=roadmap&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
plt = ggmap(map)
# Visualize arrest time on map
plt = plt + geom_point(data = arrest_tab_tmp2, aes(x = arrest_tab_tmp2$lon,
y = arrest_tab_tmp2$lat), color = a_color, alpha = 0.1)
plt = plt + guides(title = "Arrest Time", fill = guide_colorbar())
print(plt)
## Warning: Removed 61 rows containing missing values (geom_point).
<<<<<<< HEAD
<<<<<<< HEAD
<<<<<<< HEAD
What did you observe?
Q1: As expected, there is a obvious relationship between arrest time and crime type. From the figure, we can conclude the tendency for specific crime. For example, rob stores or gas stations usually happen in the late night. However, Rob banks usually happen in the afternoon.
Q2: It seems that there is a relationship between arrest time and arrest location. In places near the downtown, crimes trend to happen in night. In places far from downtown, crimes trend to happen in daytime. But this relationship still needs to be further analyzed commbing with specific geographic features.
What question are you asking?:
Is there a relationship between top crime types and age of the criminals? For instance, the younger criminals tend to commit more aggressive offence while the older tend to commit more stealthy offence.
What is the code you use to answer it?:
# Remove the data with unknown offense or other type
arrestData <- subset(arrest_tab, age > 0 & incidentOffense != "Unknown Offense" &
incidentOffense != "79-Other")
arrestData <- subset(arrestData, select = c(age, incidentOffense))
# Calculate the frequency of each crime type
offenseFreq <- table(arrestData$incidentOffense)
offenseFreq <- sort(offenseFreq, decreasing = TRUE)
N <- 15
topN <- data.frame(IncidentOffense = names(offenseFreq[1:N]), Freq = as.integer(offenseFreq[1:N]))
arrestData$topIncident <- sapply(arrestData$incidentOffense, function(x) {
id <- 0
for (y in names(offenseFreq[1:N])) {
id <- id + 1
if (x == y) {
return(id)
}
}
return(-1)
})
arrestData <- subset(arrestData, arrestData$topIncident != -1)
# Sort y axis by median of age; sort color by frequency
arrestData$incidentOffense = with(arrestData, reorder(incidentOffense, age,
median))
arrestData$topOffense = with(arrestData, reorder(incidentOffense, topIncident,
median))
Top_15_Offenses <- arrestData$topOffense
qplot(factor(arrestData$incidentOffense), arrestData$age, main = "Relationship Between Age and Crime Type",
xlab = "Crime Type", ylab = "Age", geom = "boxplot", asp = 2, col = Top_15_Offenses) +
coord_flip()
<<<<<<< HEAD
What did you observe?:
There's indeed a weak relationship between age and criminal type. The older criminals tend to commit crimes with less violence such as shop lifting and burglary. E.G. the median of shoplifting criminals is of age 40; the younger criminals ten to commit crimes with more violence and sabotage such as trespassing, disorder, destrction of property.
However, the age range of criminals can be adolescents to relatively old poeple. Few people commit destrction of property after the age of 60.
What question are you asking?: Is the difference in arrest numbers between “black” and “white” races due to population demographics?
What is the code you use to answer it?:
#### Plot distribution of race for arrests based on neighborhood, based on code
#### by Ben Klimowski and Jon Fetter-Degges
total <- table(arrest_tab$neighborhood[arrest_tab$neighborhood != ""])
black.arrests <- arrest_tab$neighborhood[arrest_tab$race == "B"]
black <- table(black.arrests[black.arrests != ""])
white.arrests <- arrest_tab$neighborhood[arrest_tab$race == "W"]
white <- table(white.arrests[white.arrests != ""])
# 0-fill unused neighborhoods and resort
black[names(total)[!(names(total) %in% names(black))]] <- 0
black <- black[order(names(black))]
white[names(total)[!(names(total) %in% names(white))]] <- 0
white <- white[order(names(white))]
# Combine into data frame
df <- data.frame(total)
df$black <- black
df$white <- white
df <- df[order(df$Freq, df$black, df$white), ]
df$other <- df$Freq - df$black - df$white
#### Load population demographics
census_tab <- read.csv("Census_Demographics_2010.csv", stringsAsFactors = FALSE)
grepFun <- function(neighborhood) {
ns <- unlist(strsplit(neighborhood, "/"))
gr <- grep(ns[1], census_tab$CSA2010)
index <- 0
if (any(gr))
index <- gr[1] else if (length(ns) > 1) {
gr <- grep(ns[2], census_tab$CSA2010)
if (any(gr))
index <- gr[1]
}
index
}
# Look up census information for each neighborhood
df$census_ind <- apply(as.matrix(df$Var1), 1, grepFun)
# Remove neighborhoods not found in census
df <- df[df$census_ind > 0, ]
df$peraa <- census_tab$peraa10[df$census_ind]
df$perwhite <- census_tab$perwhite10[df$census_ind]
df$expected_b <- df$peraa * (df$Freq/100)
## Error: non-numeric argument to binary operator
df$expected_w <- df$perwhite * (df$Freq/100)
## Error: non-numeric argument to binary operator
#### Plot Results
rm(total, black, black.arrests, white, white.arrests, census_tab)
# Generate a plot of neighborhoods with the racial distribution of, along
# with points for the expected values.
old_par <- par(no.readonly = TRUE)
par(las = 2, mar = c(4, 9, 3, 2) + 0.1)
toPlot <- df$Freq > 600
df.bar <- barplot(t(as.matrix(df[toPlot, 3:4])), names.arg = df$Var1[toPlot],
horiz = TRUE, col = c("green", "red"), cex.names = 0.7, main = "Race of Arrests by Neighborhood")
points(x = df$expected_b[toPlot], y = df.bar, col = "blue", pch = 18)
## Error: 'x' and 'y' lengths differ
points(x = df$expected_w[toPlot] + df$black[toPlot], y = df.bar, col = "black",
pch = 18)
## Error: 'x' and 'y' lengths differ
legend(2000, 5, legend = c("black", "white", "expected black", "expected white"),
fill = c("green", "red", "blue", "black"), cex = 0.7)
<<<<<<< HEAD
<<<<<<< HEAD
Here I am making a big assumption that people tend to get arrested in the same neighborhood in which they live. Therefore the results may not apply as well in a neighborhood like “Downtown” where more people commute to the neighborhood than live there. For about half of the listed neighborhoods, black arrests are higher than mere population demographics would predict and white arrests are lower. This could be due to a variety of factors, not least of which is potential racism among the police force. The other half follow the expected results very well, and Cherry Hill even reverses the finding. Therefore I conclude that race of arrest is at leasty partially influenced by factors other than from population demographics.
What question are you asking?
We are looking for information gain between multiple attributes in the dataset. In the following code we analyse the dependencies between age, sex, race, incident location, district, neighborhood, date, and time of arrest. Finally, we plot a graph depicting the relationships between the attributes. The strength of relationship is quantified by the color of the edges, the more red the edges, the greater the strength.
What is the code you used to answer it?
library(FSelector)
arrest_tab = read.csv("BPD_Arrests.csv", stringsAsFactors = FALSE)
mod_arrest_tab <- arrest_tab
notneeded <- c("arrest", "charge", "chargeDescription", "Location.1", "arrestLocation",
"incidentOffense", "post")
mod_arrest_tab <- mod_arrest_tab[, !(names(mod_arrest_tab) %in% notneeded)]
mod_arrest_tab$arrestDate <- unlist(lapply(mod_arrest_tab$arrestDate, function(x) substr(x,
1, 2)))
mod_arrest_tab$arrestTime <- unlist(lapply(mod_arrest_tab$arrestTime, function(x) substr(x,
1, 2)))
wage <- information.gain(age ~ ., mod_arrest_tab)
wsex <- information.gain(sex ~ ., mod_arrest_tab)
wrace <- information.gain(race ~ ., mod_arrest_tab)
wincidentLocation <- information.gain(incidentLocation ~ ., mod_arrest_tab)
wdistrict <- information.gain(district ~ ., mod_arrest_tab)
wneighborhood <- information.gain(neighborhood ~ ., mod_arrest_tab)
warrestDate <- information.gain(arrestDate ~ ., mod_arrest_tab)
warrestTime <- information.gain(arrestTime ~ ., mod_arrest_tab)
require(igraph)
## Loading required package: igraph
## Warning: there is no package called 'igraph'
# 1 - Age 2 - Sex 3 - Race 4 - Incident Location 5 - District 6 -
# Neighborhood 7 - Date 8 - Time
g1 <- graph(c(1, 2, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, 1, 8, 2, 3, 2, 4, 2, 5, 2,
6, 2, 7, 2, 8, 3, 4, 3, 5, 3, 6, 3, 7, 3, 8, 4, 5, 4, 6, 4, 7, 4, 8, 5,
6, 5, 7, 5, 8, 6, 7, 6, 8, 7, 8), directed = FALSE)
## Error: could not find function "graph"
V(g1)$label <- c("Age", "Sex", "Race", "IncidentLocation", "District", "Neighborhood",
"ArrestDate", "ArrestTime")
## Error: object 'g1' not found
colors = heat.colors(1000)
# set colors for Age node
E(g1)[1]$color <- colors[999 - as.integer(200 * wage["sex", 1])]
## Error: object 'g1' not found
E(g1)[2]$color <- colors[999 - as.integer(200 * wage["race", 1])]
## Error: object 'g1' not found
E(g1)[3]$color <- colors[999 - as.integer(200 * wage["incidentLocation", 1])]
## Error: object 'g1' not found
E(g1)[4]$color <- colors[999 - as.integer(200 * wage["district", 1])]
## Error: object 'g1' not found
E(g1)[5]$color <- colors[999 - as.integer(200 * wage["neighborhood", 1])]
## Error: object 'g1' not found
E(g1)[6]$color <- colors[999 - as.integer(200 * wage["arrestDate", 1])]
## Error: object 'g1' not found
E(g1)[7]$color <- colors[999 - as.integer(200 * wage["arrestTime", 1])]
## Error: object 'g1' not found
# set colors for Sex node
E(g1)[8]$color <- colors[999 - as.integer(200 * wsex["race", 1])]
## Error: object 'g1' not found
E(g1)[9]$color <- colors[999 - as.integer(200 * wsex["incidentLocation", 1])]
## Error: object 'g1' not found
E(g1)[10]$color <- colors[999 - as.integer(200 * wsex["district", 1])]
## Error: object 'g1' not found
E(g1)[11]$color <- colors[999 - as.integer(200 * wsex["neighborhood", 1])]
## Error: object 'g1' not found
E(g1)[12]$color <- colors[999 - as.integer(200 * wsex["arrestDate", 1])]
## Error: object 'g1' not found
E(g1)[13]$color <- colors[999 - as.integer(200 * wsex["arrestTime", 1])]
## Error: object 'g1' not found
# set colors for Race node
E(g1)[14]$color <- colors[999 - as.integer(200 * wrace["incidentLocation", 1])]
## Error: object 'g1' not found
E(g1)[15]$color <- colors[999 - as.integer(200 * wrace["district", 1])]
## Error: object 'g1' not found
E(g1)[16]$color <- colors[999 - as.integer(200 * wrace["neighborhood", 1])]
## Error: object 'g1' not found
E(g1)[17]$color <- colors[999 - as.integer(200 * wrace["arrestDate", 1])]
## Error: object 'g1' not found
E(g1)[18]$color <- colors[999 - as.integer(200 * wrace["arrestTime", 1])]
## Error: object 'g1' not found
E(g1)[19]$color <- colors[999 - as.integer(200 * wincidentLocation["district",
1])]
## Error: object 'g1' not found
E(g1)[20]$color <- colors[999 - as.integer(200 * wincidentLocation["neighborhood",
1])]
## Error: object 'g1' not found
E(g1)[21]$color <- colors[999 - as.integer(200 * wincidentLocation["arrestDate",
1])]
## Error: object 'g1' not found
E(g1)[22]$color <- colors[999 - as.integer(200 * wincidentLocation["arrestTime",
1])]
## Error: object 'g1' not found
E(g1)[23]$color <- colors[999 - as.integer(200 * wdistrict["neighborhood", 1])]
## Error: object 'g1' not found
E(g1)[24]$color <- colors[999 - as.integer(200 * wdistrict["arrestDate", 1])]
## Error: object 'g1' not found
E(g1)[25]$color <- colors[999 - as.integer(200 * wdistrict["arrestTime", 1])]
## Error: object 'g1' not found
E(g1)[26]$color <- colors[999 - as.integer(200 * wneighborhood["arrestDate",
1])]
## Error: object 'g1' not found
E(g1)[27]$color <- colors[999 - as.integer(200 * wneighborhood["arrestTime",
1])]
## Error: object 'g1' not found
E(g1)[28]$color <- colors[999 - as.integer(200 * warrestDate["arrestTime", 1])]
## Error: object 'g1' not found
plot(g1)
## Error: object 'g1' not found
# 1 - Age 2 - Sex 3 - Race 4 - Date 5 - Time
g2 <- graph(c(1, 2, 1, 3, 1, 4, 1, 5, 2, 3, 2, 4, 2, 5, 3, 4, 3, 5, 4, 5), directed = FALSE)
## Error: could not find function "graph"
V(g2)$label <- c("Age", "Sex", "Race", "ArrestDate", "ArrestTime")
## Error: object 'g2' not found
# set colors for Age node
E(g2)[1]$color <- colors[999 - as.integer(60000 * wage["sex", 1])]
## Error: object 'g2' not found
E(g2)[2]$color <- colors[999 - as.integer(60000 * wage["race", 1])]
## Error: object 'g2' not found
E(g2)[3]$color <- colors[999 - as.integer(60000 * wage["arrestDate", 1])]
## Error: object 'g2' not found
E(g2)[4]$color <- colors[999 - as.integer(60000 * wage["arrestTime", 1])]
## Error: object 'g2' not found
# set colors for Sex node
E(g2)[5]$color <- colors[999 - as.integer(60000 * wage["race", 1])]
## Error: object 'g2' not found
E(g2)[6]$color <- colors[999 - as.integer(60000 * wage["arrestDate", 1])]
## Error: object 'g2' not found
E(g2)[7]$color <- colors[999 - as.integer(60000 * wage["arrestTime", 1])]
## Error: object 'g2' not found
# set colors for Race node
E(g2)[8]$color <- colors[999 - as.integer(60000 * wage["arrestDate", 1])]
## Error: object 'g2' not found
E(g2)[9]$color <- colors[999 - as.integer(60000 * wage["arrestTime", 1])]
## Error: object 'g2' not found
# set colors for Date node
E(g2)[9]$color <- colors[999 - as.integer(60000 * wage["arrestTime", 1])]
## Error: object 'g2' not found
plot(g2)
## Error: object 'g2' not found
What did you observe?
We observed a strong relationship between atrributes pertaining to location and all other attributes. This should be expected as location is a fairly unique attribute of any arrest, and therefore predictive of other attributes.
=======What question are you asking?:
Which types of arrested crimes can benefit from CCTV cameras? Is there any relationship between the number of CCTV cameras and the number of arrests on specific incident offense? In addition, what is the difference of the performance on sex?
=======
Zebao Gao
What question are you asking?:
What are the major types of criminals for males/females in comparison?
>>>>>>> 2574bd065679b3b24b0d520209815fbcdab1d612
What is the code you use to answer it?:
<<<<<<< HEAD
narcotics <- c("87-Narcotics", "87O-Narcotics (Outside)", "97-Search & Seizure")
violent <- c("1A-Murder", "2A-Rape (Force)", "2B-Rape (Attempt)", "3AF-Robb Hwy-Firearm",
"3AJF-Robb Carjack-Firearm", "3AJK-Robb Carjack-Knife", "3AK-Robb Hwy-Knife",
"3AO-Robb Hwy-Other Wpn", "3CF-Robb Comm-Firearm", "3CK-Robb Comm-Knife",
"3CO-Robb Comm-Other Wpn", "3EF-Robb Gas Station-Firearm", "3EK-Robb Gas Station-Knife",
"3GF-Robb Conv Store-Firearm", "3GK-Robb Conv Store-Knife", "3GO-Robb Conv Store-Other Wpn",
"3JF-Robb Residence-Firearm", "3JK-Robb Residence-Knife", "3JO-Robb Residence-Other Wpn",
"3LF-Robb Bank-Firearm", "3FO-RObb Bank-Other Wpn", "3NF-Robb Misc-Firearm",
"3NK-Robb Misc-Knife", "3NO-Robb Misc-Other Wpn", "4A-Agg. Asslt.- Gun",
"4B-Agg. Asslt.- Cut", "4C-Agg. Asslt.- Oth.", "4D-Agg. Asslt.- Hand")
IncidentViaCCTV <- function(offenses, sexual) {
subsetvio = subset(arrest_tab, arrest_tab$incidentOffense %in% offenses)
if (sexual == "M" || sexual == "F")
subsetsex = subset(subsetvio, subsetvio$sex == sexual) else subsetsex = subsetvio
latRange = range(arrest_tab$lat, na.rm = TRUE)
lonRange = range(arrest_tab$lon, na.rm = TRUE)
latGrid = seq(min(latRange), max(latRange), len = 50)
lonGrid = seq(min(lonRange), max(lonRange), len = 50)
latFac = as.numeric(cut(subsetsex$lat, breaks = latGrid))
lonFac = as.numeric(cut(subsetsex$lon, breaks = lonGrid))
gridFac = (latFac - 1) * length(latGrid) + (lonFac - 1)
latFac = as.numeric(cut(cctv_tab$lat, breaks = latGrid))
lonFac = as.numeric(cut(cctv_tab$lon, breaks = lonGrid))
cctvGridFac = (latFac - 1) * length(latGrid) + (lonFac - 1)
arrestTab = table(gridFac)
cctvTab = table(cctvGridFac)
m = match(names(cctvTab), names(arrestTab))
plot(arrestTab[m] ~ factor(cctvTab))
}
IncidentViaCCTV(narcotics, "M")
IncidentViaCCTV(narcotics, "F")
IncidentViaCCTV(violent, "M")
IncidentViaCCTV(violent, "F")
IncidentViaCCTV(c("4E-Common Assault"), "M")
IncidentViaCCTV(c("4E-Common Assault"), "F")
What did you observe?:
From previous "challenge" example, we know that the relationship between the number of CCTV cameras and the number of arrests is not obvious. However, I guess that there should be some certain incident offenses that woule benefit from CCTV caremas. (Otherwise, we can give up CCTV caremas.)
First I tested on narcotics offenses. It is not influensed by the number of cameras. I guess it is because most people have drug in the houses, such that would not be photoed by cameras.
Second I tested on violent crimes. Ignoing the small cases, the number of arrested crimes roughly increases as the number of caremas, especially on female. Obviously, CCTV cameras can catch violent crimes.
Furthermore, I tested on the certain offenses, "4E-Common Assault". The relationship is more obvious.
So using this function, we can know the which types of crimes would benefit from CCTV caremas.
>>>>>>> f520a6901b5f9c89f7fa77a941026b208c6bfe34
=======
Phil Nguyen
Question: What is the distribution of arrest time for each neighborhood?
Code I use:
# convert arrest time to numeric, e.g. 11:45 -> 11.75
time.to.num <- function(s) {
x <- as.numeric(gsub(":", "", s))
(x%/%100) + (x%%100)/60
}
# abbreviate district names
abbrev <- function(s) {
gsub("CENTRAL", "C", gsub("EASTERN", "E", gsub("WESTERN", "W", gsub("NORTHERN",
"N", gsub("NORTHWESTERN", "NW", gsub("SOUTHEASTERN", "SE", gsub("SOUTHERN",
"S", gsub("SOUTHWESTERN", "SW", gsub("NORTHEASTERN", "NE", s)))))))))
}
# filter out empty districts
dat <- arrest_tab[arrest_tab$district != "", ]
plot(time.to.num(dat$arrestTime) ~ factor(abbrev(dat$district)), xlab = "Neighborhood",
ylab = "Arrest Time")
# earliest and latest arrest time median
median(time.to.num(dat$arrestTime[dat$district == "NORTHWESTERN"]))
## [1] 17.07
median(time.to.num(dat$arrestTime[dat$district == "CENTRAL"]))
## [1] 14.33
What I observed: Overall, more than half of the arrests happen between 10AM and 8PM.
There is no siginificant difference between arrest time in different neighborhoods,
although arrests happen slightly earlier in central and the southern areas compared to northern areas.
>>>>>>> 0e04f064bb355d87894fd2913c6d15a069b0daf6
=======
# get arrest data related to sex and incidenOffense, and remove unknown
# offense types
arrest_data <- subset(arrest_tab, select = c(sex, incidentOffense))
arrest_data <- subset(arrest_data, !(incidentOffense %in% c("79-Other", "Unknown Offense")))
# calculate frequence of criminals by males or females for each criminal
# type
arrest_table <- table(arrest_data$incidentOffense, arrest_data$sex)
arrest_frame <- as.data.frame.table(arrest_table)
arrest_frame_m <- subset(arrest_frame, arrest_frame$Var2 == "M")
arrest_frame_f <- subset(arrest_frame, arrest_frame$Var2 == "F")
arrest_frame_all <- data.frame(incidentOffense = arrest_frame_m$Var1, M = arrest_frame_m$Freq,
F = arrest_frame_f$Freq)
# remove criminal types occured for very few times
arrest_frame_all <- subset(arrest_frame_all, arrest_frame_all$M > 2 & arrest_frame_all$F >
2)
arrest_frame_all$MRatio = 100 * arrest_frame_all$M/(arrest_frame_all$M + arrest_frame_all$F)
arrest_frame_all$FRatio = 100 * arrest_frame_all$F/(arrest_frame_all$M + arrest_frame_all$F)
# remove unusable data
rm(arrest_table, arrest_frame, arrest_frame_m, arrest_frame_f)
arrest_frame_all <- arrest_frame_all[order(arrest_frame_all$MRatio), ]
par(las = 2, mar = c(4, 9, 3, 2) + 0.1)
barplot(t(as.matrix(arrest_frame_all[arrest_frame_all$MRatio > 87, 4:5])), horiz = TRUE,
col = c("blue", "red"), cex.names = 0.7, main = "Major Criminal Types for Males",
names.arg = arrest_frame_all$incidentOffense[arrest_frame_all$MRatio > 87])
arrest_frame_all <- arrest_frame_all[order(arrest_frame_all$FRatio), ]
barplot(t(as.matrix(arrest_frame_all[arrest_frame_all$FRatio > 20, 5:4])), horiz = TRUE,
col = c("red", "blue"), cex.names = 0.7, main = "Major Criminal for Females",
names.arg = arrest_frame_all$incidentOffense[arrest_frame_all$FRatio > 20])
rm(arrest_frame_all)
What did you observe?:
In the two charts, the criminal types which are mainly committed by males/females are listed. The blue/red parts stand for the percentage of cases committed by males/females.
There's a difference between the major types of criminals for males and females. Although males take a larger percentage in most of the criminal types, most of the major types are still kind of voilent. The characteristics of criminal types for females are more related to sex, children and disabilites.
>>>>>>> 2574bd065679b3b24b0d520209815fbcdab1d612